Septmeber 21, 2015

What does my group do?

  • Study the molecular basis of variation in development and disease
  • Using high-throughput experimental methods

The genomic revolution

  • For over a decade we have laid the basic molecular blueprint by sequencing DNA

NHGRI strategic plan

[Nature, 2011]

NHGRI strategic plan

"The major bottleneck in genome sequencing is no longer data generation—the computational challenges around data analysis, display and integration are now rate limiting. New approaches and methods are required to meet these challenges."

  • Data analysis
  • Data integration
  • Visualization
  • Computational tools and infrastructure
[Nature, 2011]

My group's work as a simplex

Computational Epigenomics

What is epigenomics?

What makes them different?

Much human variation is due to difference in ~6 million DNA base pairs (0.1% of genome)

What makes them different?

Genes are expressed differently during different stages and in different tissues.

DNA is packed, making certain parts inaccessible, and this packing is dynamic.

DNA methylation is a chemical modification of DNA, involved in gene expression regulation.

[Robertson and Wolffe, Nat Rev Genet, 2000]

Probing DNA methylation

The data

Probing DNA methylation

  • local-likelihood smoothing method
  • high-frequency smoothing estimates local methylation structure (small domains)
  • low-frequency smoothing estimates long-range methylation structure (large domains)
Nature Genetics, 2011Bioinformatics, 2013

DNA methylation in cancer

Large blocks of hypo-methylation in colon cancer

Nat. Genetics, 2011
  • overlaps with other important genomic domains
  • genes within these blocks are tissue-specific

Genes with hyper-variable expression in colon cancer are enriched within these blocks.

Nat. Genetics, 2011

Hypo-methylation blocks observed across five solid tumor types.

Genome Medicine, 2014

Gene expression hyper-variability enriched in hypo-methylation blocks in other cancer types.

Genome Medicine, 2014

Genes with consistent hyper-variable expression across tumors are tissue-specific.

BMC Bioinformatics, 2013

Summary

  • large domains of methylation loss are a stable mark across cancer types
  • gene expression hyper-variability is enriched within these domains
  • hyper-variable genes within these regions are tissue-specific and involved in cellular fate

Genes are expressed differently during different stages and in different tissues.

Gene expression anti-profiles

  • molecular methods for cancer detection, prognosis and treatment matching will be the basis of individualized medicine
  • gene expression profile methods have been subject of study for decades
  • very few proposed predictors are translated to the clinic
  • by far, the biggest culprit is irreproducibility of results in preliminary studies

anti-profile score: measures sample-specific deviation from normal expression in consistently hyper-variable genes

BMC Bioinformatics, 2013

  • Feature selection: top 100 genes with greatest hyper-variable expression in tumor:

\[ \log_2 \frac{\text{std. dev}_{\text{cancer}}}{\text{std. dev}_{\text{normal}}} \]

  • Range of normal expression:

\[ \mathrm{med} \, \text{normal expression}_g \pm 5 \times \mathrm{mad} \, \text{normal expression}_g \]

  • anti-profile score: number of genes in sample where expression is outside normal range

Good cross-experiment properties
Stability in normal expression across experiments

BMC Bioinformatics, 2013

Prediction in leave-one-tissue out experiment

BMC Bioinformatics, 2013

Anti-profile score distinguishes between stages in tumor progression

Cancer Informatics, 2015

DNA methylation anti-profiles score distinguishes between stages in tumor progression

Cancer Informatics, 2015

Stratification based on anti-profile score

Cancer Informatics, 2015

Stratification of breast samples based on anti-profile score

Cancer Informatics, 2015

Summary

  • Simple counting scheme produces robust stable and accurate (anti)-profiles
  • Nice prediction properties across experiments and across tissue types
  • Captures increasing hyper-variability associated with progression and prognosis

Stability Analysis via Support Vector Machines

One-class Support Vector Machines

Support Vector Machines for Anomaly Detection: determine if observations belong to a given group or are anomalies.

Anomaly Classification

  • Distinguish observations from two anomalous groups (e.g., adenoma vs. tumor)

  • How can we incorporate the fact that we are classifying anomalies?

  • Why (and when) is it worth doing that?

Anomaly Support Vector Machine

Learning functions in space spanned by (representers) of normal samples

\[ f(x) = \sum_i c_i k(x, z_i) + d \]

where \(z_i\) are normal observations.

Anomaly Support Vector Machine

Estimated as solution to optimization problem (like regular SVM) by solving

\[ \min_{c,d} \sum_j (1-y_jf_j)_+ + c'\tilde{K}c \]

with \(f_j = \sum_i c_i k(x_j,z_i) + d\)

and \(\tilde{K}=K_s K_n^{-1} K_s\)

Anomaly Support Vector Machine

  • Using leave-one-out error bounds (via stability arguments)
  • \(K \succeq \tilde{K}\), implies LOO error bound of Anomaly SVM is lower than LOO error bound of standard pairwise between anomalies SVM
  • Proof uses arguments based on the SVM path algorithm, and mild conditions

Stability and accuracy of Anomaly SVM

Prediction of high vs. low relapse risk in lung cancer

Prediction of suspect vs. pathological fetal CTG data (not genomics)

Summary

  • Profiles learned based on hyper-variability show consistent behavior across tissues and across experiments in tumor prognosis and progression
  • We can extend the general anti-profile idea to a function approximation setting
  • Use sensitivity-based cross-validation error bounds to characterize the effect of incorporating normal observations when classifying between anomalies
  • Indirect similarity through normal samples improves stability while improving prediction performance

This is a heterogeneous cell population

Methylation pattern reconstruction problem

  • Given a set of mapped reads
  • Determine composition of cell-specific methylation patterns

Methylation pattern reconstruction problem

\[ \mathbb{E} y_v = \sum_{u:(v,u) |in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \]

  • Penalized method of moments:
  • number of parameters = number of paths through graph
  • sparsity inducing penalty to obtain solution with small number of patterns

\[ \min_{\theta_p} \sum_v |y_v - \sum_{u:(v,u)\in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p | + \lambda \sum_p |\theta_p | \]

Moving forward

  • move anti-profiles closer to the clinic
  • explore anomaly classification as a general learning setting
  • methods to understand hierarchical organization of epigenomic domains
  • better understand connection between intra-tumor heterogeneity and consistent hyper-variability in cancer

  • Discoveries: consistent hypo-methylation, hyper-variability
  • Methods: anomaly classification as a setting to understand predictor stability
  • Tools

Tools

  • State-of-the-art computational and statistical analysis platform
  • We develop and apply methods for these analyses in this platform
  • Our collaborators do analysis in this platform with us
  • antiProfiles
  • minfi
  • bumphunter
  • Rcplex
  • Rcsdp
  • HTShape
  • qsmooth

Collaborative and exploratory analysis

  • Data transformation and modeling: data smoothing, region finding (R/Bioconductor: Bsmooth, minfi)
  • Genome browsing: search by gene, search by overlap
  • Region analysis: overlap with other data (our own, other labs, UCSC, ensembl)
  • Regulation: expression data (Gene Expression Barcode)

Genomic Data Science!

  • We have unprecedented ability to measure
  • and lots of publicly available data to contextualize it
[H. Wickham]

Integrative, visual and computational exploratory analysis of genomic data

  • Browser-based
  • Interactive
  • Integration of data
  • Reproducible dissemination
  • Communication with R/Bioconductor: epivizr package
e.g.: http://epiviz.cbcb.umd.edu/?ws=YOsu0RmUc9l
[Nat. Methods, 2014]

Creativity in exploration

We are building software systems to support creative exploratory analysis of large genome-wide datasets…

[T. Speed]

Computed Measurements: create new measurements from integrated measurements and visualize

Summarization: summarize integrated measurements (computed on data subsets)

Statistically-guided exploration: Calculate a statistic of interest

# Get tumor methylation base-pair data
m <- assay(se)[,"tumor"]

# Compute regions with highest variability across cpgs
region_stat <- calcWindowStat(m, step=25, window=80, stat=rowSds)
s <- region_stat[,"stat"]

Explore data based on statistic

What's around the regions with highest across CpG variability?

# get locations in decreasing order
o <- order(s, decreasing=TRUE)
indices <- region_stat[o, "indices"]
slideShowRegions <- rowRanges(se)[indices] + 1250000L
mgr$slideshow(slideShowRegions)

dynamically extensible: Easily integrate new data types and add new visualizations.

  • Based on classic "three-table" design in genomic data analysis
  • Data providers define coordinate space

[H. Wickham]

One interpretation of Big Data is Many relevant sources of contextual data

  • Easily access/integrate contextual data
  • Driven by exploratory analysis of immediate data
  • Iterative process
  • Visual and computational exploration go hand in hand

Visualization goals

  • Context
  • Integrate and align multiple data sources; navigate; search
  • Connect: brushing
  • Encode: map visualization properties to data on the fly
  • Reconfigure: multiple views of the same data
[Perer & Shneiderman]

Visualization goals

  • Data
  • Select and filter: tight-knit integration with R/Bioconductor;
  • (current work) filters on visualization propagate to data environment
  • Model
  • New 'measurements' the result of modeling; perhaps suggested by data context
[Perer & Shneiderman]

Moving forward

  • collaborative computational and visual analysis (w/ N. Elmqvist @ HCIL)
  • effective visual methods to explore hierarchical organization of genome
  • deeper integration of statistically-informed visualization
  • visualization-informed statistical analysis

  • Discoveries: consistent hypo-methylation, hyper-variability
  • Methods: anomaly classification as a setting to understand predictor stability
  • Tools: computational and visual exploratory genomic data analysis

Metagenomics (mixed genomes)

[Human Microbiome Project]

Metagenomics (mixed genomes)

  • Discoveries: pathogenic associations for childhood diarrhea in developing world. (Genome Biology, 2014)
  • Methods: association discovery for metagenomic communities. (Nature Methods, 2013)
  • Tools: metagenomeSeq, metagenomicFeatures, metaviz

Coordinates:

Samples:

Hierachically organized features

Hierarchically organized features

NHGRI strategic plan

"Meeting the computational challenges for genomics requires scientists with expertise in biology as well as in informatics, computer science, mathematics, statistics and/or engineering."

A new generation of investigators who are proficient in two or more of these fields must be trained and supported.

Acknowledgements

Past members of HCBravo group
now at Harvard, U. Chicago, Johns Hopkins, Genentech, Dow Jones Data Science

Colleagues at CBCB
Current members of HCBravo group
Collaborators at JHU/Harvard

Funding: NIH, Genentech, Gates Foundation

More information

http://hcbravo.org
@hcorrada